[BugFix] Fix PPL mixed text/keyword field type across wildcard indices (#4659)#5358
Conversation
opensearch-project#4659) When querying wildcard indices where a field has conflicting types (text vs keyword) across indices, the Calcite script engine incorrectly uses DOC_VALUE retrieval for all shards. This causes shard failures on shards where the field is text (no doc_values), resulting in silently dropped documents. Add TextKeywordConflictRule to the index mapping merge pipeline that detects text/keyword conflicts and merges to text WITHOUT keyword subfields. This forces _source retrieval which works universally across all shards regardless of the actual field type. Signed-off-by: Heng Qian <qianheng@amazon.com>
Decision LogRoot Cause: When querying wildcard indices ( Approach: Added Alternatives Rejected:
Pitfalls:
Things to Watch:
|
PR Reviewer Guide 🔍(Review updated until commit 17e7410)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 17e7410
Previous suggestionsSuggestions up to commit c0459b5
|
| return false; | ||
| } | ||
| // Match when one is text and the other is keyword | ||
| if (isTextLike(sourceMapping) && isKeyword(targetMapping)) { |
There was a problem hiding this comment.
If a field is text but has keyword subfield, and the other is keyword, should they match and use text only?
There was a problem hiding this comment.
Yes — this is intentional. Consider the scenario:
- Index A: field
msgistextwith a keyword subfield (msg.keyword) - Index B: field
msgiskeyword
If the text-with-keyword-subfield wins the merge, toKeywordSubField() returns msg.keyword, leading to DOC_VALUE retrieval on the subfield path msg.keyword. But on Index B's shards, msg is just a keyword field — there is no msg.keyword subfield. DOC_VALUE retrieval for msg.keyword would fail on those shards, producing the same silent data loss as the original bug.
By matching this case and merging to plain text (without keyword subfields), we force _source retrieval, which works universally.
The existing test testMatchTextAndKeyword (line 25) covers the pure text-vs-keyword case, and this isTextLike check intentionally covers text-with-keyword-subfield-vs-keyword as well.
There was a problem hiding this comment.
Q: Curios that could this case be improved to a DSL with full DOC_VALUE retrieval?
There was a problem hiding this comment.
Curios that could this case be improved to a DSL with full DOC_VALUE retrieval?
@LantaoJin Text only field doesn't have doc value.
There was a problem hiding this comment.
My question is for If a field is text but has keyword subfield, and the other is keyword
There was a problem hiding this comment.
Discussed offline, the current implementation is sufficient for bugfix. Supporting above enhancement requires additional investigation.
| if (isTextLike(sourceMapping) && isKeyword(targetMapping)) { | ||
| return true; | ||
| } | ||
| if (isKeyword(sourceMapping) && isTextLike(targetMapping)) { |
There was a problem hiding this comment.
Same reasoning as above — the reverse direction (keyword source, text-like target) also needs to match to ensure we always merge to plain text without keyword subfields.
… ...) Signed-off-by: Heng Qian <qianheng@amazon.com>
|
Persistent review updated to latest commit 17e7410 |
Description
When querying wildcard indices where a field has conflicting types (text vs keyword) across indices, the Calcite script engine incorrectly uses DOC_VALUE retrieval for all shards. This causes shard failures on shards where the field is text (no doc_values), resulting in silently dropped documents.
Root cause: The
LatestRulemerge strategy non-deterministically picks one type when merging field mappings from multiple indices. When keyword is picked,RexStandardizer.visitInputRef()selects DOC_VALUE retrieval viaOpenSearchTextType.toKeywordSubField(). On shards where the field is actually text, doc_values are not available, causing the script to return null and OpenSearch to silently swallow the shard failure.Fix: Added
TextKeywordConflictRuleto the index mapping merge pipeline (betweenDeepMergeRuleandLatestRule). When text and keyword types conflict across indices, the rule merges toOpenSearchTextTypeWITHOUT keyword subfields. This forces_sourceretrieval, which works universally across all shards regardless of the actual field type.Related Issues
Resolves #4659
Check List
-s)spotlessCheckpassed